Exploration of White Wine Quality

by Alex Jenkins

Introduction

This report explores the idea of determining wine quality based on some chemical properties. Selecting a good wine can be challenging, and I imagine making good wines is even harder. It would be nice to isolate the different chemical properties inherent to excellent wines; perhaps knowing this would be beneficial in wine making.

The wine quality dataset collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis consists of measurements of some of the chemical properties of wines. Each observation includes a quality score from 0 (very bad) to 10 (very excellent); these scores are based on sensory data. First, I’ll explore the distributions of each property and then observe the relationships among the chemical properties and their relationships with quality. Let’s see if we can identify good wines based on these measurements.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Univariate Plots Section

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

According to the dataset, the quality is a score between 0 and 10, 0 being very poor and 10 equals very excellent. The quality median and mean rounded to nearest score is 6.0. I’m going to use the following interpretation since the min and max of the observations are 3 and 9, respectively.

Quality Interpretation

0 - seriously?!
1 - very poor
2 - poor
3 - very bad
4 - bad
5 - below average
6 - average
7 - above average
8 - good
9 - excellent
10 - very excellent

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

This is the distribution of density, measured in grams per liter. There are some clear outliers here; they have been removed in the second graph.

Above we have the distributions of fixed acidity, volatile acidity, and citric acid all measured in grams per liter. Each of these have clear outliers in their right tails. These could signify very excellent or very poor wines. We’ll observe this later.

This is the distribution of pH. Its variance appears to be less than the variances of the acidity measurements as there is no clear outliers in the right tail.

The distributions of chlorides, free SO2, total SO2, and sulfates have some outliers in their right tails also. There’s little variance in the alcohol measurements.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
## 
##   0.6   0.7   0.8   0.9  0.95     1  1.05   1.1  1.15   1.2  1.25   1.3 
##     2     7    25    39     4    93     1   146     3   187     3   147 
##  1.35   1.4  1.45   1.5  1.55   1.6  1.65   1.7  1.75   1.8  1.85   1.9 
##     2   184     4   142     2   165     2    99     1    99     3    59 
##  1.95     2  2.05   2.1   2.2  2.25   2.3  2.35   2.4   2.5   2.6  2.65 
##     2    79     1    51    56     2    42     1    41    40    33     1 
##   2.7   2.8  2.85   2.9     3   3.1  3.15   3.2   3.3   3.4   3.5   3.6 
##    38    36     1    25    17    17     1    28    23    13    31    22 
##   3.7  3.75   3.8  3.85   3.9  3.95     4   4.1   4.2  4.25   4.3  4.35 
##    12     2    21     3    17     3    19    17    31     2    19     1 
##   4.4  4.45   4.5  4.55   4.6   4.7  4.75   4.8  4.85   4.9     5   5.1 
##    14     3    33     2    40    29     5    38     1    35    43    28 
##  5.15   5.2  5.25   5.3  5.35   5.4  5.45   5.5  5.55   5.6   5.7   5.8 
##     2    29     4    17     2    23     2    13     1    16    30    23 
##  5.85   5.9  5.95     6   6.1   6.2   6.3  6.35   6.4   6.5  6.55   6.6 
##     2    19     1    23    21    31    39     1    34    26     1    30 
##  6.65   6.7  6.75   6.8  6.85   6.9  6.95     7  7.05   7.1   7.2  7.25 
##     3    25     1    28     6    20     1    31     2    36    29     2 
##   7.3  7.35   7.4  7.45   7.5   7.6   7.7  7.75   7.8  7.85   7.9  7.95 
##    19     2    40     1    30    29    34     2    41     1    32     1 
##     8   8.1  8.15   8.2  8.25   8.3   8.4  8.45   8.5  8.55   8.6  8.65 
##    32    34     1    36     2    31    13     1    24     1    27     1 
##   8.7  8.75   8.8   8.9  8.95     9  9.05   9.1  9.15   9.2  9.25   9.3 
##    18     2    22    23     1    18     1    17     2    22     2    11 
##   9.4   9.5  9.55   9.6  9.65   9.7   9.8  9.85   9.9    10 10.05  10.1 
##    10     9     1    18     4    22    16     3    18    18     3    14 
##  10.2  10.3  10.4  10.5 10.55  10.6 10.65  10.7  10.8  10.9    11  11.1 
##    23    16    25    16     1    22     1    26    17    11    19    18 
##  11.2 11.25  11.3  11.4 11.45  11.5  11.6  11.7 11.75  11.8  11.9 11.95 
##    18     2    12    14     1    11    15     8     4    35    16     3 
##    12 12.05  12.1 12.15  12.2  12.3  12.4  12.5 12.55  12.6  12.7 12.75 
##    16     1    21     4    15    13    19    16     2    16    16     1 
##  12.8 12.85  12.9    13  13.1 13.15  13.2  13.3  13.4  13.5 13.55  13.6 
##    25     4    25    19    23     1    13    16     7    10     3    12 
## 13.65  13.7  13.8  13.9    14 14.05  14.1 14.15  14.2  14.3 14.35  14.4 
##     4    21     8    18    16     1     4     1    20    17     3    17 
## 14.45  14.5 14.55  14.6  14.7 14.75  14.8  14.9 14.95    15  15.1 15.15 
##     3    17     3    13    14     2    12    14     2    13     7     1 
##  15.2 15.25  15.3  15.4  15.5 15.55  15.6  15.7 15.75  15.8  15.9    16 
##     6     1     9    17    11     6    14     9     1     6     2    10 
## 16.05  16.1  16.2  16.3  16.4 16.45  16.5 16.55  16.6 16.65  16.7 16.75 
##     6     2     7     7     5     1     3     1     2     5     5     2 
##  16.8 16.85  16.9 16.95    17 17.05  17.1  17.2  17.3 17.35  17.4 17.45 
##     4     4     3     3     1     1     5     9    14     1     2     2 
##  17.5 17.55  17.6  17.7 17.75  17.8 17.85  17.9 17.95    18 18.05  18.1 
##     8     3     2     1     4    13     5     2     3     2     3     6 
## 18.15  18.2  18.3 18.35  18.4  18.5  18.6 18.75  18.8  18.9 18.95  19.1 
##     8     3     2     4     1     1     1     4     3     1     3     1 
## 19.25  19.3 19.35  19.4 19.45  19.5  19.6  19.8  19.9 19.95 20.15  20.2 
##     3     4     1     2     3     2     1     4     1     3     1     2 
##  20.3  20.4  20.7  20.8    22  22.6  23.5 26.05  31.6  65.8 
##     1     1     2     2     2     1     1     2     2     1

Let’s take a closer look at residual sugar. Its distribution might be bi-modal.

The information provided with the dataset describes acidity as either fixed or volatile. It also states that the fixed acidity is attributed to the amount of tartaric acid in the wine, and volatile acidity is attributed to the amount of acetic acid. Citric acid is another attribute of the data but is not denoted as fixed or volatile. Because of the specificity of the fixed acidity attribute, I’ll assume citric acid is another form of fixed acidity.

Free sulfur dioxide is a component of the total sulfur dioxide. Let’s create two new variables. One will represent the ratio of free sulfur dioxide to total sulfur dioxide. The other will represent the total acidity; this will be the sum of the fixed (tartaric), citric, and volatile (acetic) acids.

The SO2 ratio appears to be centered around 0.26 and the total acidity near 7.6 g/L.

Univariate Analysis

What is the structure of your dataset?

The white wine dataset consists of 4898 observations of 12 variables (alcohol, chlorides, citric acid, fixed acidity, free sulfur dioxide, pH, residual sugar, sulfates, total sulfur dioxide, volatile acidity, and quality). I have added two more variables (sulfur dioxide ratio and total acidity). All variables except quality are continuous. Quality is a categorical variable ranging from 0 to 10; in this dataset quality ranges from 3 to 9.

What is/are the main feature(s) of interest in your dataset?

I would like to know which chemical properties influence the quality of white wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Distribution of citric acid looks similar to that of density. I’m curious to see if these two variables are correlated. Also density, citric acid, chlorides, free sulfur dioxide, and residual sugar had clear outliers. I would like to see if these outliers signify wine quality.

Did you create any new variables from existing variables in the dataset?

I created the variable sulfur dioxide ratio; it is the value of the free sulfur dioxide divided by the total sulfur dioxide. I also created the variable total acidity. It is the sum of the fixed acidity, citric acid, and volatile acidity.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of residual sugar looks as though there may be two groups to consider. Some outliers were cut off to get a better look at the distribution.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
## sulfur.dioxide.ratio   -0.13945918      -0.19616085  0.016241396
## total.acidity           0.98717874       0.07157062  0.394143356
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## sulfur.dioxide.ratio     0.05142979 -0.03321768        0.7386321024
## total.acidity            0.10473749  0.04552987       -0.0451333172
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## sulfur.dioxide.ratio         -0.013447850 -0.06552475  0.0008012900
## total.acidity                 0.113188502  0.27560881 -0.4306513315
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000
## sulfur.dioxide.ratio -0.02236186  0.06446642  0.197214077
## total.acidity        -0.01185225 -0.11751272 -0.131377207
##                      sulfur.dioxide.ratio total.acidity
## fixed.acidity                 -0.13945918    0.98717874
## volatile.acidity              -0.19616085    0.07157062
## citric.acid                    0.01624140    0.39414336
## residual.sugar                 0.05142979    0.10473749
## chlorides                     -0.03321768    0.04552987
## free.sulfur.dioxide            0.73863210   -0.04513332
## total.sulfur.dioxide          -0.01344785    0.11318850
## density                       -0.06552475    0.27560881
## pH                             0.00080129   -0.43065133
## sulphates                     -0.02236186   -0.01185225
## alcohol                        0.06446642   -0.11751272
## quality                        0.19721408   -0.13137721
## sulfur.dioxide.ratio           1.00000000   -0.15258717
## total.acidity                 -0.15258717    1.00000000

None of the correlations of quality and other properties are the highest in this scatterplot matrix. The plots of the relationships of quality versus the other properties do not show clear dependencies on any particular property. I want to observe the distributions of each chemical property according to quality scores.

Quality vs Alcohol

It appears as alcohol content increases so does the quality score; however the bad wines have more alcohol than the below average quality wines. I’m less concerned with the actual quality score; I’m more interested in whether a wine is good or bad. From now on, I’ll use good, bad, or average to denote wine quality.

##       bad below avg   average above avg excellent 
##       183      1457      2198       880       180

There are a lot more below average wines than there are above average wines. However, the number of excellent and bad wine observations are almost the same. It should be interesting to see the difference between the measurements for these two groups.

There is almost a 2-point difference in the means between bad and excellent wine alcohol content.

Quality vs Chlorides

I want to take a closer look in order to find a line of demarcation between good and bad white wines.

The mean chloride level of bad wines is greater than the level of excellent wines. Higher levels of sodium chloride could represent a more bitter taste in the wine.

Quality vs Density

The density of wine depends on the alcohol and sugar content, and I can see the shape of the plot is almost the inverse of the shape in the plot of alcohol and quality.

Quality vs Acidity

Fixed Acidity

There are no major differences in the means of fixed acidity levels among the groups. Fixed acidity must not be a major factor in wine quality.

Citric Acid

There is a separation between the means of the citric acid levels in the excellent wine and bad wine groups, but not much difference between excellent wines and below average wines. The amount of citric acid could mean the difference in personal preferences when it comes to wine tastes.

Volatile Acidity

Volatile acidity is the level of acetic acid in wine, and too much acetic acid is a bad thing. Approximately 300 mg/L of acetic acid appears to be too much.

Total Acidity

There appears to be a relationship between total acidity and quality. Because there was not clear relationship between fixed acidity or citric acid and quality, I assume the relationship between total acidity and quality is due to the levels of volatile acidity.

pH

The mean pH of wines becomes more basic as wine quality increases from bad to excellent.

Quality vs Sulfur Dioxide

Free Sulfur Dioxide

There is a difference of around 17 mg/L between the mean free SO2 levels of bad wines and excellent wines.

Sulfur Dioxide Ratio

The mean SO2 ratio increases from around 0.16 to 0.29 as wine quality increases from bad to excellent.

Quality vs Sulfates

The mean levels of potassium sulfate, represented by the variable sulphates, is almost the same across each quality level.

Quality vs Residual Sugar

Mean residual sugar levels vary across each quality group. There is no clear relationship. I am also interested in the relationship of some of the variables with higher correlations.

Sugar vs Alcohol

pH vs Total Acidity

Total Sulfur Dioxide vs Alcohol

I can imagine a negatively-sloped line running through each scatterplot above. These would fit with the computed correlations from above.

Total Sulfur Dioxide vs Alcohol

Potassium Sulfate (represented by the sulphates variable) is an additive that contributes to the sulfur dioxide levels in the wine. I want to examine that relationship.

I can imagine a positively-sloped line running through the relationship between sulfates and total SO2. However, the correlation between the two does not appear to be high.

## wines[, 16]: bad
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.200   Min.   :0.110    Min.   :0.0000   Min.   : 0.700  
##  1st Qu.: 6.400   1st Qu.:0.260    1st Qu.:0.2050   1st Qu.: 1.350  
##  Median : 6.900   Median :0.320    Median :0.3000   Median : 2.700  
##  Mean   : 7.181   Mean   :0.376    Mean   :0.3077   Mean   : 4.821  
##  3rd Qu.: 7.650   3rd Qu.:0.460    3rd Qu.:0.4000   3rd Qu.: 7.500  
##  Max.   :11.800   Max.   :1.100    Max.   :0.8800   Max.   :17.550  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01300   Min.   :  3.00      Min.   : 10.0       
##  1st Qu.:0.03750   1st Qu.:  9.00      1st Qu.: 85.5       
##  Median :0.04600   Median : 18.00      Median :119.0       
##  Mean   :0.05056   Mean   : 26.63      Mean   :130.2       
##  3rd Qu.:0.05400   3rd Qu.: 33.50      3rd Qu.:177.0       
##  Max.   :0.29000   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates        alcohol     
##  Min.   :0.9892   Min.   :2.830   Min.   :0.250   Min.   : 8.00  
##  1st Qu.:0.9926   1st Qu.:3.060   1st Qu.:0.380   1st Qu.: 9.40  
##  Median :0.9941   Median :3.160   Median :0.470   Median :10.10  
##  Mean   :0.9943   Mean   :3.183   Mean   :0.476   Mean   :10.17  
##  3rd Qu.:0.9960   3rd Qu.:3.285   3rd Qu.:0.540   3rd Qu.:10.80  
##  Max.   :1.0004   Max.   :3.720   Max.   :0.870   Max.   :13.50  
##     quality      sulfur.dioxide.ratio total.acidity   
##  Min.   :3.000   Min.   :0.03371      Min.   : 4.645  
##  1st Qu.:4.000   1st Qu.:0.10543      1st Qu.: 7.020  
##  Median :4.000   Median :0.16129      Median : 7.630  
##  Mean   :3.891   Mean   :0.18883      Mean   : 7.865  
##  3rd Qu.:4.000   3rd Qu.:0.23852      3rd Qu.: 8.450  
##  Max.   :4.000   Max.   :0.65682      Max.   :12.410  
## -------------------------------------------------------- 
## wines[, 16]: below avg
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.500   Min.   :0.100    Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.240    1st Qu.:0.2400   1st Qu.: 1.800  
##  Median : 6.800   Median :0.280    Median :0.3200   Median : 7.000  
##  Mean   : 6.934   Mean   :0.302    Mean   :0.3377   Mean   : 7.335  
##  3rd Qu.: 7.400   3rd Qu.:0.340    3rd Qu.:0.4100   3rd Qu.:11.500  
##  Max.   :10.300   Max.   :0.905    Max.   :1.0000   Max.   :23.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.04000   1st Qu.: 22.00      1st Qu.:121.0       
##  Median :0.04700   Median : 35.00      Median :151.0       
##  Mean   :0.05155   Mean   : 36.43      Mean   :150.9       
##  3rd Qu.:0.05300   3rd Qu.: 50.00      3rd Qu.:182.0       
##  Max.   :0.34600   Max.   :131.00      Max.   :344.0       
##     density             pH          sulphates         alcohol      
##  Min.   :0.9872   Min.   :2.790   Min.   :0.2700   Min.   : 8.000  
##  1st Qu.:0.9933   1st Qu.:3.080   1st Qu.:0.4200   1st Qu.: 9.200  
##  Median :0.9953   Median :3.160   Median :0.4700   Median : 9.500  
##  Mean   :0.9953   Mean   :3.169   Mean   :0.4822   Mean   : 9.809  
##  3rd Qu.:0.9972   3rd Qu.:3.240   3rd Qu.:0.5300   3rd Qu.:10.300  
##  Max.   :1.0024   Max.   :3.790   Max.   :0.8800   Max.   :13.600  
##     quality  sulfur.dioxide.ratio total.acidity   
##  Min.   :5   Min.   :0.02362      Min.   : 4.900  
##  1st Qu.:5   1st Qu.:0.17188      1st Qu.: 6.970  
##  Median :5   Median :0.23810      Median : 7.500  
##  Mean   :5   Mean   :0.23772      Mean   : 7.574  
##  3rd Qu.:5   3rd Qu.:0.29646      3rd Qu.: 8.120  
##  Max.   :5   Max.   :0.65000      Max.   :11.030  
## -------------------------------------------------------- 
## wines[, 16]: average
##  fixed.acidity    volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.000   Min.   : 0.700  
##  1st Qu.: 6.300   1st Qu.:0.2000   1st Qu.:0.270   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2500   Median :0.320   Median : 5.300  
##  Mean   : 6.838   Mean   :0.2606   Mean   :0.338   Mean   : 6.442  
##  3rd Qu.: 7.300   3rd Qu.:0.3000   3rd Qu.:0.380   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :0.9650   Max.   :1.660   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01500   Min.   :  3.00      Min.   : 18.0       
##  1st Qu.:0.03600   1st Qu.: 24.00      1st Qu.:107.2       
##  Median :0.04300   Median : 34.00      Median :132.0       
##  Mean   :0.04522   Mean   : 35.65      Mean   :137.0       
##  3rd Qu.:0.04900   3rd Qu.: 46.00      3rd Qu.:164.0       
##  Max.   :0.25500   Max.   :112.00      Max.   :294.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9876   Min.   :2.720   Min.   :0.2300   Min.   : 8.50  
##  1st Qu.:0.9917   1st Qu.:3.080   1st Qu.:0.4100   1st Qu.: 9.60  
##  Median :0.9937   Median :3.180   Median :0.4800   Median :10.50  
##  Mean   :0.9940   Mean   :3.189   Mean   :0.4911   Mean   :10.58  
##  3rd Qu.:0.9959   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.810   Max.   :1.0600   Max.   :14.00  
##     quality  sulfur.dioxide.ratio total.acidity   
##  Min.   :6   Min.   :0.03361      Min.   : 4.130  
##  1st Qu.:6   1st Qu.:0.19836      1st Qu.: 6.860  
##  Median :6   Median :0.25862      Median : 7.370  
##  Mean   :6   Mean   :0.26217      Mean   : 7.436  
##  3rd Qu.:6   3rd Qu.:0.32046      3rd Qu.: 7.940  
##  Max.   :6   Max.   :0.71053      Max.   :14.960  
## -------------------------------------------------------- 
## wines[, 16]: above avg
##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :4.200   Min.   :0.0800   Min.   :0.0100   Min.   : 0.900  
##  1st Qu.:6.200   1st Qu.:0.1900   1st Qu.:0.2800   1st Qu.: 1.700  
##  Median :6.700   Median :0.2500   Median :0.3100   Median : 3.650  
##  Mean   :6.735   Mean   :0.2628   Mean   :0.3256   Mean   : 5.186  
##  3rd Qu.:7.200   3rd Qu.:0.3200   3rd Qu.:0.3600   3rd Qu.: 7.325  
##  Max.   :9.200   Max.   :0.7600   Max.   :0.7400   Max.   :19.250  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   :  5.00      Min.   : 34.0       
##  1st Qu.:0.03100   1st Qu.: 25.00      1st Qu.:101.0       
##  Median :0.03700   Median : 33.00      Median :122.0       
##  Mean   :0.03819   Mean   : 34.13      Mean   :125.1       
##  3rd Qu.:0.04400   3rd Qu.: 41.00      3rd Qu.:144.2       
##  Max.   :0.13500   Max.   :108.00      Max.   :229.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.840   Min.   :0.2200   Min.   : 8.60  
##  1st Qu.:0.9906   1st Qu.:3.100   1st Qu.:0.4100   1st Qu.:10.60  
##  Median :0.9918   Median :3.200   Median :0.4800   Median :11.40  
##  Mean   :0.9925   Mean   :3.214   Mean   :0.5031   Mean   :11.37  
##  3rd Qu.:0.9937   3rd Qu.:3.320   3rd Qu.:0.5800   3rd Qu.:12.30  
##  Max.   :1.0004   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality  sulfur.dioxide.ratio total.acidity  
##  Min.   :7   Min.   :0.0500       Min.   :4.730  
##  1st Qu.:7   1st Qu.:0.2118       1st Qu.:6.810  
##  Median :7   Median :0.2717       Median :7.310  
##  Mean   :7   Mean   :0.2757       Mean   :7.323  
##  3rd Qu.:7   3rd Qu.:0.3333       3rd Qu.:7.820  
##  Max.   :7   Max.   :0.6429       Max.   :9.870  
## -------------------------------------------------------- 
## wines[, 16]: excellent
##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :3.900   Min.   :0.120    Min.   :0.0400   Min.   : 0.800  
##  1st Qu.:6.200   1st Qu.:0.200    1st Qu.:0.2800   1st Qu.: 2.075  
##  Median :6.800   Median :0.260    Median :0.3200   Median : 4.300  
##  Mean   :6.678   Mean   :0.278    Mean   :0.3282   Mean   : 5.628  
##  3rd Qu.:7.300   3rd Qu.:0.330    3rd Qu.:0.3600   3rd Qu.: 8.150  
##  Max.   :9.100   Max.   :0.660    Max.   :0.7400   Max.   :14.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01400   Min.   :  6.00      Min.   : 59.0       
##  1st Qu.:0.03000   1st Qu.: 28.00      1st Qu.:102.8       
##  Median :0.03550   Median : 34.50      Median :122.0       
##  Mean   :0.03801   Mean   : 36.63      Mean   :125.9       
##  3rd Qu.:0.04400   3rd Qu.: 44.25      3rd Qu.:148.5       
##  Max.   :0.12100   Max.   :105.00      Max.   :212.5       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.940   Min.   :0.2500   Min.   : 8.50  
##  1st Qu.:0.9903   1st Qu.:3.127   1st Qu.:0.3800   1st Qu.:11.00  
##  Median :0.9916   Median :3.230   Median :0.4600   Median :12.00  
##  Mean   :0.9922   Mean   :3.221   Mean   :0.4857   Mean   :11.65  
##  3rd Qu.:0.9935   3rd Qu.:3.330   3rd Qu.:0.5825   3rd Qu.:12.60  
##  Max.   :1.0006   Max.   :3.590   Max.   :0.9500   Max.   :14.00  
##     quality      sulfur.dioxide.ratio total.acidity  
##  Min.   :8.000   Min.   :0.07895      Min.   :4.525  
##  1st Qu.:8.000   1st Qu.:0.22308      1st Qu.:6.855  
##  Median :8.000   Median :0.28767      Median :7.385  
##  Mean   :8.028   Mean   :0.28929      Mean   :7.284  
##  3rd Qu.:8.000   3rd Qu.:0.33621      3rd Qu.:7.768  
##  Max.   :9.000   Max.   :0.60377      Max.   :9.820

We can compare the statistical values of each variable according to the wine quality and notice values that differentiate excellent wines from poor wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Taking a look at the distribution of each variable according to the quality grade highlights some of the differences between high and low quality wines. High quality wines when compared to bad wines have:
  • higher alcohol content (above 11%)
  • lower chlorides (around 35mg/L)
  • lower acetic acid (around 25mg/L)
  • higher free SO2 (35g/L)
  • higher free SO2 to total SO2 ratio
  • Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

    I was curious about the relationships between sulfates and the sulfur dioxide levels. I was surprised there was not a more evident relationship. I also observed the correlation between sugar and alcohol, pH and total acidity, and alcohol and total sulfur dioxide. The correlations computed earlier were evident in the graphs.

    What was the strongest relationship you found?

    The strongest correlation I found was between fixed acidity and total acidity, but that is because of how the variable was created. After that density and residual sugar have a correlation of 0.84. Other correlations of note include:
  • alcohol and density
  • total sulfur dioxide and density
  • sugar and alcohol
  • alcohol and total sulfur dioxide
  • alcohol and quality
  • pH and total acidity

  • Some of these relations are expected since density depends on the alcohol and sugar content in wine. The acidity and basicity are described by the pH of wine. Alcohol and quality have a correlation of 0.44; good wines tend to have more alcohol than bad wines.

    Multivariate Plots Section

    Quality vs Alcohol and Volatile Acidity

    The first plot shows how the different combinations of alcohol and volatile acidity levels contribute to wine quality. The second plot shows the location of each cluster.

    Quality vs Alcohol, Volatile Acidity, and SO2 Ratio

    This shows the location of the alcohol-volatile acidity clusters in relation to the SO2 ratio.

    Quality vs Chlorides, Residual Sugar, and Alcohol Ratio

    This graph shows the clusters of chlorides and residual sugar relationship. They observations are colored by alcohol content helping to distinguish high and low quality wines.

    Quality vs Free SO2, Alcohol, and SO2 Ratio

    This graph shows the relationship of alcohol, free SO2, and the SO2 ratio by quality. Excellent wines are clustering around 12% alcohol and a free SO2 level around 30 mg/L.

    Quality vs Chlorides, Alcohol, and SO2 Ratio

    Alcohol seems to be the better chemical property for distinguishing cluster centers. Chloride levels of around 40 mg/L and 12% alcohol are present in excellent wines.

    Quality vs Free SO2, Volatile Acidity, and Alcohol Ratio

    There’s no clear division in this graph.

    Quality vs Free SO2, Chlorides, and Alcohol Ratio

    I know alcohol content around 12% is a good indicator of good quality wines. The red observations in the bad and below average plots indicate that combinations of levels less than 0.03 g/L of chlorides and less than 20 g/L of free SO2 are not good for wines.

    Multivariate Analysis

    Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

    Alcohol strengthens many of the variables when looking at wine quality. Quality clusters by alcohol content. The plots in the Multivariate section show clusters of wine quality for different combinations of alcohol content, chlorides, free SO2, SO2 ratio, and volatile acidity.

    Were there any interesting or surprising interactions between features?

    Low chlorides and high alcohol interact well when using residual sugar to differentiate high quality wines. In the Bivariate section, the sulfur dioxide ratio increased wine quality as its value increased towards 40%. Its effect on wine quality was less evident when plotted against alcohol and chlorides.


    Final Plots and Summary

    Plot One

    Description One

    Wines with the highest quality have the highest median alcohol content as opposed to lower quality wines which have lower median alcohol content. The median alcohol content by volume is 10.1, 9.5, 10.5, 11.4, and 12.0 percent for bad, below average, average, above average, and excellent wines respectively. That leaves a margin of 1.9 percent between bad wines and excellent wines.

    Plot Two

    Description Two

    Wines with the highest quality have the highest median SO2 ratio as opposed to lower quality wines which have lower median SO2 ratio. The median SO2 ratio is 0.16, 0.24, 0.26, 0.27, and 0.29 respectively for bad, below average, average, above average, and excellent wines. As the wine quality increases, the median SO2 ratio increases also. A linear model could be constructed to predict the quality of wines using the ratio of free SO2 to total SO2.

    Plot Three

    Quality vs Alcohol, Volatile Acidity, and SO2 Ratio

    Description Three

    High quality wines and bad wines are clustering in two different regions. The mean alcohol and volatile acidity levels in excellent wines is 11.65% and 278 mg/L, respectively. Excellent wines are clustered above 11% alcohol and near 300 mg/L of volatile acidity (acetic acid). These wines more likely have higher SO2 ratios as well; the mean SO2 ratio is 0.29. The alcohol content in bad wines is centered around the mean 10.17% by volume; the mean SO2 ratio is 0.19. The mean volatile acidity (acetic acid) level in bad wines is 376 mg/L. Bad wines are clustered below 11% alcohol and more likely have higher levels of volatile acidity and lower SO2 ratios.


    Reflection

    The white wines dataset contains information for almost 5,000 wines. After investigating the distributions of the individual variables and outliers in the data set, I explored the relationships between certain variables using plots. I am interested in discerning which variables affect the quality of white wines. Initially, I was confused by the variables, their names, and the information provided about the dataset. I first created two new variables representative of the relationships among separate variables. Then I observed the quality of wines across each variable. I was disappointed the scatterplot matrix did not show high correlations apart from the variables I created. I wanted to see clear relationships between quality and the other variables; the scatterplots showed a lot of overplotting. Also there were too many quality levels for distinguishing good wines from bad ones so I combined some levels into groups distinguishing bad, average, and good. Plotting the distribution of each variable by the levels of quality showed the relationships I wanted to see. I noticed trends in the effect alcohol, chlorides, volatile acidity, and free SO2 have on wine quality. Also I was surprised that sulfates did not have a higher correlation with free SO2 and the total SO2. And I was curious to see some correlation between alcohol content and the amount of residual sugar.

    The multivariate analysis shows that in further exploration a linear model could be fit to the data. Alcohol and chlorides were good indicators of wine quality. Comparing the 12% alcohol level of usually good wines with the other variables showed the negative effect their levels could have on wine quality. For example, the plot of quality vs alcohol, chlorides, and free SO2 shows the effect of too low alcohol content and too high chloride levels on wine quality. There is not a lot of margin in some chemical properties when comparing good and bad wines. The task will be a difficult one, but perhaps white wine quality can be predicted using some of the variables mentioned previously.

    References

    https://en.wikipedia.org/wiki/Acids_in_wine
    http://www.calwineries.com/learn/wine-chemistry/acidity
    http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity
    http://www.r-bloggers.com/bot-botany-k-means-and-ggplot2/